In [1]:
%matplotlib inline
import pandas as pd
import numpy as np

from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()


Out[1]:

Data Preparation using pandas

An initial step in statistical data analysis is the preparation of the data to be used in the analysis. In practice, a little some much the majority of the actual time spent on a statistical modeling project is typically devoted to importing, cleaning, validating and transforming the dataset.

This section will introduce pandas, an important third-party Python package for data analysis, as a tool for data preparation, and provide some general advice for what should or should not be done to data before it is analyzed.

Introduction to pandas

pandas is a Python package providing fast, flexible, and expressive data structures designed to work with relational or labeled data both. It is a fundamental high-level building block for doing practical, real world data analysis in Python.

pandas is well suited for:

  • Tabular data with heterogeneously-typed columns, as you might find in an SQL table or Excel spreadsheet
  • Ordered and unordered (not necessarily fixed-frequency) time series data.
  • Arbitrary matrix data with row and column labels

Virtually any statistical dataset, labeled or unlabeled, can be converted to a pandas data structure for cleaning, transformation, and analysis.

Key features

  • Easy handling of missing data
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Intuitive merging and joining data sets
  • Flexible reshaping and pivoting of data sets
  • Hierarchical labeling of axes
  • Robust IO tools for loading data from flat files, Excel files, databases, and HDF5
  • Time series functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

Series

A Series is a single vector of data (like a NumPy array) with an index that labels each element in the vector.


In [2]:
counts = pd.Series([632, 1638, 569, 115])
counts


Out[2]:
0     632
1    1638
2     569
3     115
dtype: int64

If an index is not specified, a default sequence of integers is assigned as the index. A NumPy array comprises the values of the Series, while the index is a pandas Index object.


In [3]:
counts.values


Out[3]:
array([ 632, 1638,  569,  115])

In [4]:
counts.index


Out[4]:
Int64Index([0, 1, 2, 3], dtype='int64')

We can assign meaningful labels to the index, if they are available. These counts are of bacteria taxa constituting the microbiome of hospital patients, so using the taxon of each bacterium is a useful index.


In [5]:
bacteria = pd.Series([632, 1638, 569, 115], 
    index=['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes'])

bacteria


Out[5]:
Firmicutes         632
Proteobacteria    1638
Actinobacteria     569
Bacteroidetes      115
dtype: int64

These labels can be used to refer to the values in the Series.


In [6]:
bacteria['Actinobacteria']


Out[6]:
569

In [7]:
bacteria[bacteria.index.str.endswith('bacteria')]


Out[7]:
Proteobacteria    1638
Actinobacteria     569
dtype: int64

In [8]:
'Bacteroidetes' in bacteria


Out[8]:
True

Notice that the indexing operation preserved the association between the values and the corresponding indices.

We can still use positional indexing if we wish.


In [9]:
bacteria[0]


Out[9]:
632

We can give both the array of values and the index meaningful labels themselves:


In [10]:
bacteria.name = 'counts'
bacteria.index.name = 'phylum'
bacteria


Out[10]:
phylum
Firmicutes         632
Proteobacteria    1638
Actinobacteria     569
Bacteroidetes      115
Name: counts, dtype: int64

NumPy's math functions and other operations can be applied to Series without losing the data structure.


In [11]:
np.log(bacteria)


Out[11]:
phylum
Firmicutes        6.448889
Proteobacteria    7.401231
Actinobacteria    6.343880
Bacteroidetes     4.744932
Name: counts, dtype: float64

We can also filter according to the values in the Series:


In [12]:
bacteria[bacteria>1000]


Out[12]:
phylum
Proteobacteria    1638
Name: counts, dtype: int64

A Series can be thought of as an ordered key-value store. In fact, we can create one from a dict:


In [13]:
bacteria_dict = {'Firmicutes': 632, 'Proteobacteria': 1638, 'Actinobacteria': 569, 'Bacteroidetes': 115}
bact = pd.Series(bacteria_dict)

In [14]:
bact


Out[14]:
Actinobacteria     569
Bacteroidetes      115
Firmicutes         632
Proteobacteria    1638
dtype: int64

Notice that the Series is created in key-sorted order.

If we pass a custom index to Series, it will select the corresponding values from the dict, and treat indices without corrsponding values as missing. pandas uses the NaN (not a number) type for missing values.


In [15]:
bacteria2 = pd.Series(bacteria_dict, 
                      index=['Cyanobacteria','Firmicutes','Proteobacteria','Actinobacteria'])
bacteria2


Out[15]:
Cyanobacteria      NaN
Firmicutes         632
Proteobacteria    1638
Actinobacteria     569
dtype: float64

In [16]:
bacteria2.isnull()


Out[16]:
Cyanobacteria      True
Firmicutes        False
Proteobacteria    False
Actinobacteria    False
dtype: bool

Critically, the labels are used to align data when used in operations with other Series objects:


In [17]:
bacteria + bacteria2


Out[17]:
Actinobacteria    1138
Bacteroidetes      NaN
Cyanobacteria      NaN
Firmicutes        1264
Proteobacteria    3276
dtype: float64

Contrast this with NumPy arrays, where arrays of the same length will combine values element-wise; adding Series combined values with the same label in the resulting series. Notice also that the missing values were propogated by addition.

DataFrame

Inevitably, we want to be able to store, view and manipulate data that is multivariate, where for every index there are multiple fields or columns of data, often of varying data type.

A DataFrame is a tabular data structure, encapsulating multiple series like columns in a spreadsheet. Data are stored internally as a 2-dimensional object, but the DataFrame allows us to represent and manipulate higher-dimensional data.


In [18]:
bacteria_data = pd.DataFrame({'value':[632, 1638, 569, 115, 433, 1130, 754, 555],
                     'patient':[1, 1, 1, 1, 2, 2, 2, 2],
                     'phylum':['Firmicutes', 'Proteobacteria', 'Actinobacteria', 
    'Bacteroidetes', 'Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes']})
bacteria_data


Out[18]:
patient phylum value
0 1 Firmicutes 632
1 1 Proteobacteria 1638
2 1 Actinobacteria 569
3 1 Bacteroidetes 115
4 2 Firmicutes 433
5 2 Proteobacteria 1130
6 2 Actinobacteria 754
7 2 Bacteroidetes 555

Notice the DataFrame is sorted by column name. We can change the order by indexing them in the order we desire:


In [19]:
bacteria_data[['phylum','value','patient']]


Out[19]:
phylum value patient
0 Firmicutes 632 1
1 Proteobacteria 1638 1
2 Actinobacteria 569 1
3 Bacteroidetes 115 1
4 Firmicutes 433 2
5 Proteobacteria 1130 2
6 Actinobacteria 754 2
7 Bacteroidetes 555 2

A DataFrame has a second index, representing the columns:


In [20]:
bacteria_data.columns


Out[20]:
Index(['patient', 'phylum', 'value'], dtype='object')

If we wish to access columns, we can do so either by dict-like indexing or by attribute:


In [21]:
bacteria_data['value']


Out[21]:
0     632
1    1638
2     569
3     115
4     433
5    1130
6     754
7     555
Name: value, dtype: int64

In [22]:
bacteria_data.value


Out[22]:
0     632
1    1638
2     569
3     115
4     433
5    1130
6     754
7     555
Name: value, dtype: int64

Using the standard indexing syntax for a single column of data from a DataFrame returns the column as a Series.


In [23]:
type(bacteria_data['value'])


Out[23]:
pandas.core.series.Series

Passing the column name as a list returns the column as a DataFrame instead.


In [24]:
bacteria_data[['value']]


Out[24]:
value
0 632
1 1638
2 569
3 115
4 433
5 1130
6 754
7 555

Notice that indexing works differently with a DataFrame than with a Series, where in the latter, dict-like indexing retrieved a particular element (row). If we want access to a row in a DataFrame, we index its ix attribute.


In [25]:
bacteria_data.ix[3]


Out[25]:
patient                1
phylum     Bacteroidetes
value                115
Name: 3, dtype: object

Since a row potentially contains different data types, the returned Series of values is of the generic object type.

If we want to create a DataFrame row-wise rather than column-wise, we can do so with a dict of dicts:


In [26]:
bacteria_data = pd.DataFrame({0: {'patient': 1, 'phylum': 'Firmicutes', 'value': 632},
                    1: {'patient': 1, 'phylum': 'Proteobacteria', 'value': 1638},
                    2: {'patient': 1, 'phylum': 'Actinobacteria', 'value': 569},
                    3: {'patient': 1, 'phylum': 'Bacteroidetes', 'value': 115},
                    4: {'patient': 2, 'phylum': 'Firmicutes', 'value': 433},
                    5: {'patient': 2, 'phylum': 'Proteobacteria', 'value': 1130},
                    6: {'patient': 2, 'phylum': 'Actinobacteria', 'value': 754},
                    7: {'patient': 2, 'phylum': 'Bacteroidetes', 'value': 555}})

In [27]:
bacteria_data


Out[27]:
0 1 2 3 4 5 6 7
patient 1 1 1 1 2 2 2 2
phylum Firmicutes Proteobacteria Actinobacteria Bacteroidetes Firmicutes Proteobacteria Actinobacteria Bacteroidetes
value 632 1638 569 115 433 1130 754 555

However, we probably want this transposed:


In [28]:
bacteria_data = bacteria_data.T
bacteria_data


Out[28]:
patient phylum value
0 1 Firmicutes 632
1 1 Proteobacteria 1638
2 1 Actinobacteria 569
3 1 Bacteroidetes 115
4 2 Firmicutes 433
5 2 Proteobacteria 1130
6 2 Actinobacteria 754
7 2 Bacteroidetes 555

Views

Its important to note that the Series returned when a DataFrame is indexed is merely a view on the DataFrame, and not a copy of the data itself. So you must be cautious when manipulating this data.

For example, let's isolate a column of our dataset by assigning it as a Series to a variable.


In [29]:
vals = bacteria_data.value
vals


Out[29]:
0     632
1    1638
2     569
3     115
4     433
5    1130
6     754
7     555
Name: value, dtype: object

Now, let's assign a new value to one of the elements of the Series.


In [30]:
vals[5] = 0
vals


Out[30]:
0     632
1    1638
2     569
3     115
4     433
5       0
6     754
7     555
Name: value, dtype: object

However, we may not anticipate that the value in the original DataFrame has also been changed!


In [31]:
bacteria_data


Out[31]:
patient phylum value
0 1 Firmicutes 632
1 1 Proteobacteria 1638
2 1 Actinobacteria 569
3 1 Bacteroidetes 115
4 2 Firmicutes 433
5 2 Proteobacteria 0
6 2 Actinobacteria 754
7 2 Bacteroidetes 555

We can avoid this by working with a copy when modifying subsets of the original data.


In [32]:
vals = bacteria_data.value.copy()
vals[5] = 1000

bacteria_data


Out[32]:
patient phylum value
0 1 Firmicutes 632
1 1 Proteobacteria 1638
2 1 Actinobacteria 569
3 1 Bacteroidetes 115
4 2 Firmicutes 433
5 2 Proteobacteria 0
6 2 Actinobacteria 754
7 2 Bacteroidetes 555

So, as we have seen, we can create or modify columns by assignment; let's put back the value we accidentally changed.


In [33]:
bacteria_data.value[5] = 1130

Or, we may wish to add a column representing the year the data were collected.


In [34]:
bacteria_data['year'] = 2013
bacteria_data


Out[34]:
patient phylum value year
0 1 Firmicutes 632 2013
1 1 Proteobacteria 1638 2013
2 1 Actinobacteria 569 2013
3 1 Bacteroidetes 115 2013
4 2 Firmicutes 433 2013
5 2 Proteobacteria 1130 2013
6 2 Actinobacteria 754 2013
7 2 Bacteroidetes 555 2013

But note, we cannot use the attribute indexing method to add a new column:


In [35]:
bacteria_data.treatment = 1
bacteria_data


Out[35]:
patient phylum value year
0 1 Firmicutes 632 2013
1 1 Proteobacteria 1638 2013
2 1 Actinobacteria 569 2013
3 1 Bacteroidetes 115 2013
4 2 Firmicutes 433 2013
5 2 Proteobacteria 1130 2013
6 2 Actinobacteria 754 2013
7 2 Bacteroidetes 555 2013

In [36]:
bacteria_data.treatment


Out[36]:
1

Auto-alignment

When adding a column that is not a simple constant, we need to be a bit more careful. Due to pandas' auto-alignment behavior, specifying a Series as a new column causes its values to be added according to the DataFrame's index:


In [37]:
treatment = pd.Series([0]*4 + [1]*2)

treatment


Out[37]:
0    0
1    0
2    0
3    0
4    1
5    1
dtype: int64

In [38]:
bacteria_data['treatment'] = treatment

bacteria_data


Out[38]:
patient phylum value year treatment
0 1 Firmicutes 632 2013 0
1 1 Proteobacteria 1638 2013 0
2 1 Actinobacteria 569 2013 0
3 1 Bacteroidetes 115 2013 0
4 2 Firmicutes 433 2013 1
5 2 Proteobacteria 1130 2013 1
6 2 Actinobacteria 754 2013 NaN
7 2 Bacteroidetes 555 2013 NaN

Other Python data structures (ones without an index) need to be the same length as the DataFrame:


In [40]:
month = ['Jan', 'Feb', 'Mar', 'Apr']
bacteria_data['month'] = month


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-40-56b358a0dde9> in <module>()
      1 month = ['Jan', 'Feb', 'Mar', 'Apr']
----> 2 bacteria_data['month'] = month

/usr/local/lib/python3.4/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
   2119         else:
   2120             # set column
-> 2121             self._set_item(key, value)
   2122 
   2123     def _setitem_slice(self, key, value):

/usr/local/lib/python3.4/site-packages/pandas/core/frame.py in _set_item(self, key, value)
   2196 
   2197         self._ensure_valid_index(value)
-> 2198         value = self._sanitize_column(key, value)
   2199         NDFrame._set_item(self, key, value)
   2200 

/usr/local/lib/python3.4/site-packages/pandas/core/frame.py in _sanitize_column(self, key, value)
   2354 
   2355             # turn me into an ndarray
-> 2356             value = _sanitize_index(value, self.index, copy=False)
   2357             if not isinstance(value, (np.ndarray, Index)):
   2358                 if isinstance(value, list) and len(value) > 0:

/usr/local/lib/python3.4/site-packages/pandas/core/series.py in _sanitize_index(data, index, copy)
   2570 
   2571     if len(data) != len(index):
-> 2572         raise ValueError('Length of values does not match length of '
   2573                          'index')
   2574 

ValueError: Length of values does not match length of index

In [41]:
bacteria_data['month'] = ['Jan']*len(bacteria_data)

bacteria_data


Out[41]:
patient phylum value year treatment month
0 1 Firmicutes 632 2013 0 Jan
1 1 Proteobacteria 1638 2013 0 Jan
2 1 Actinobacteria 569 2013 0 Jan
3 1 Bacteroidetes 115 2013 0 Jan
4 2 Firmicutes 433 2013 1 Jan
5 2 Proteobacteria 1130 2013 1 Jan
6 2 Actinobacteria 754 2013 NaN Jan
7 2 Bacteroidetes 555 2013 NaN Jan

We can use del to remove columns, in the same way dict entries can be removed:


In [42]:
del bacteria_data['month']

bacteria_data


Out[42]:
patient phylum value year treatment
0 1 Firmicutes 632 2013 0
1 1 Proteobacteria 1638 2013 0
2 1 Actinobacteria 569 2013 0
3 1 Bacteroidetes 115 2013 0
4 2 Firmicutes 433 2013 1
5 2 Proteobacteria 1130 2013 1
6 2 Actinobacteria 754 2013 NaN
7 2 Bacteroidetes 555 2013 NaN

Or employ the drop method.


In [43]:
bacteria_data.drop('treatment', axis=1)


Out[43]:
patient phylum value year
0 1 Firmicutes 632 2013
1 1 Proteobacteria 1638 2013
2 1 Actinobacteria 569 2013
3 1 Bacteroidetes 115 2013
4 2 Firmicutes 433 2013
5 2 Proteobacteria 1130 2013
6 2 Actinobacteria 754 2013
7 2 Bacteroidetes 555 2013

We can extract the underlying data as a simple ndarray by accessing the values attribute:


In [44]:
bacteria_data.values


Out[44]:
array([[1, 'Firmicutes', 632, 2013, 0.0],
       [1, 'Proteobacteria', 1638, 2013, 0.0],
       [1, 'Actinobacteria', 569, 2013, 0.0],
       [1, 'Bacteroidetes', 115, 2013, 0.0],
       [2, 'Firmicutes', 433, 2013, 1.0],
       [2, 'Proteobacteria', 1130, 2013, 1.0],
       [2, 'Actinobacteria', 754, 2013, nan],
       [2, 'Bacteroidetes', 555, 2013, nan]], dtype=object)

Notice that because of the mix of string, integer and float (and NaN) values, the dtype of the array is object. The dtype will automatically be chosen to be as general as needed to accomodate all the columns.


In [45]:
df = pd.DataFrame({'foo': [1,2,3], 'bar':[0.4, -1.0, 4.5]})

df.values, df.values.dtype


Out[45]:
(array([[ 0.4,  1. ],
        [-1. ,  2. ],
        [ 4.5,  3. ]]), dtype('float64'))

pandas uses a custom data structure to represent the indices of Series and DataFrames.


In [46]:
bacteria_data.index


Out[46]:
Int64Index([0, 1, 2, 3, 4, 5, 6, 7], dtype='int64')

Index objects are immutable:


In [47]:
bacteria_data.index[0] = 15


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-47-d897e4dc5161> in <module>()
----> 1 bacteria_data.index[0] = 15

/usr/local/lib/python3.4/site-packages/pandas/core/index.py in __setitem__(self, key, value)
   1046 
   1047     def __setitem__(self, key, value):
-> 1048         raise TypeError("Indexes does not support mutable operations")
   1049 
   1050     def __getitem__(self, key):

TypeError: Indexes does not support mutable operations

This is so that Index objects can be shared between data structures without fear that they will be changed.


In [48]:
bacteria2.index = bacteria.index

bacteria2


Out[48]:
phylum
Firmicutes         NaN
Proteobacteria     632
Actinobacteria    1638
Bacteroidetes      569
dtype: float64

Excercise: Indexing

From the bacteria_data table above, create an index to return all rows for which the phylum name ends in "bacteria" and the value is greater than 1000.


In [49]:
# Write your answer here

Using pandas

This section, we will import and clean up some of the datasets that we will be using later on in the tutorial. And in doing so, we will introduce the key functionality of pandas that is required to use the software effectively.

Importing data

A key, but often under-appreciated, step in data analysis is importing the data that we wish to analyze. Though it is easy to load basic data structures into Python using built-in tools or those provided by packages like NumPy, it is non-trivial to import structured data well, and to easily convert this input into a robust data structure:

genes = np.loadtxt("genes.csv", delimiter=",", dtype=[('gene', '|S10'), ('value', '<f4')])

pandas provides a convenient set of functions for importing tabular data in a number of formats directly into a DataFrame object. These functions include a slew of options to perform type inference, indexing, parsing, iterating and cleaning automatically as data are imported.

Delimited data

The file olympics.1996.txt in the data directory contains counts of medals awarded at the 1996 Summer Olympic Games by country, along with the countries' respective population sizes. This data is stored in a tab-separated format.


In [50]:
!head ../data/olympics.1996.txt


Tonga	1	96165
Bahamas	1	281584
Jamaica	6	2589043
Cuba	25	10952046
Australia	41	18348078
Hungary	21	10273590
Bulgaria	15	8181047
Trinidad & Tobago	2	1196910
New Zealand	6	3621200
Norway	7	4381275

This table can be read into a DataFrame using read_table.


In [51]:
medals = pd.read_table('../data/olympics.1996.txt', sep='\t',
                       index_col=0,
                       header=None, names=['country', 'medals', 'population'])
medals.head()


Out[51]:
medals population
country
Tonga 1 96165
Bahamas 1 281584
Jamaica 6 2589043
Cuba 25 10952046
Australia 41 18348078

There is no header row in this dataset, so we specified this, and provided our own header names. If we did not specify header=None the function would have assumed the first row contained column names.

The tab separator was passed to the sep argument as \t.

The sep argument can be customized as needed to accomodate arbitrary separators. For example, we can use a regular expression to define a variable amount of whitespace, which is unfortunately common in some datasets:

sep='\s+'

Scraping Data from the Web

We would like to add another variable to this dataset. Along with population, a country's economic development may be a useful predictor of Olympic success. A very simple indicator of this might be OECD membership status.

The OECD website contains a table listing OECD member nations, along with its year of membership. We would like to import this table and extract the contries that were members as of the 1996 games.

The read_html function accepts a URL argument, and will attempt to extract all the tables from that address, returning whatever it finds in a list of DataFrames.


In [52]:
oecd_site = 'http://www.oecd.org/about/membersandpartners/list-oecd-member-countries.htm'
pd.read_html(oecd_site)


Out[52]:
[                                                   0
 0  On 14 December 1960, 20 countries originally s...
 1  Here is a list of the current Member countries...,
                                      0                1                  2   3
 0                                  NaN          Country               Date NaN
 1                                  NaN        AUSTRALIA        7 June 1971 NaN
 2                                  NaN          AUSTRIA  29 September 1961 NaN
 3                                  NaN          BELGIUM  13 September 1961 NaN
 4                                  NaN           CANADA      10 April 1961 NaN
 5                                  NaN            CHILE         7 May 2010 NaN
 6                                  NaN   CZECH REPUBLIC   21 December 1995 NaN
 7                                  NaN          DENMARK        30 May 1961 NaN
 8                                  NaN          ESTONIA    9 December 2010 NaN
 9                                  NaN          FINLAND    28 January 1969 NaN
 10                                 NaN           FRANCE      7 August 1961 NaN
 11                                 NaN          GERMANY  27 September 1961 NaN
 12                                 NaN           GREECE  27 September 1961 NaN
 13                                 NaN          HUNGARY         7 May 1996 NaN
 14                                 NaN          ICELAND        5 June 1961 NaN
 15                                 NaN          IRELAND     17 August 1961 NaN
 16                                 NaN           ISRAEL   7 September 2010 NaN
 17                                 NaN            ITALY      29 March 1962 NaN
 18                                 NaN            JAPAN      28 April 1964 NaN
 19                                 NaN            KOREA   12 December 1996 NaN
 20                                 NaN       LUXEMBOURG    7 December 1961 NaN
 21                                 NaN           MEXICO        18 May 1994 NaN
 22                                 NaN      NETHERLANDS   13 November 1961 NaN
 23                                 NaN      NEW ZEALAND        29 May 1973 NaN
 24                                 NaN           NORWAY        4 July 1961 NaN
 25                                 NaN           POLAND   22 November 1996 NaN
 26                                 NaN         PORTUGAL      4 August 1961 NaN
 27                                 NaN  SLOVAK REPUBLIC   14 December 2000 NaN
 28                                 NaN         SLOVENIA       21 July 2010 NaN
 29                                 NaN            SPAIN      3 August 1961 NaN
 30                                 NaN           SWEDEN  28 September 1961 NaN
 31                                 NaN      SWITZERLAND  28 September 1961 NaN
 32                                 NaN           TURKEY      2 August 1961 NaN
 33                                 NaN   UNITED KINGDOM         2 May 1961 NaN
 34                                 NaN    UNITED STATES      12 April 1961 NaN
 35  More on membership and enlargement              NaN                NaN NaN]

There is typically some cleanup that is required of the returned data, such as the assignment of column names or conversion of types.

The table of interest is at index 1, and we will extract two columns from the table. Otherwise, this table is pretty clean.


In [53]:
oecd = pd.read_html(oecd_site, header=0)[1][[1,2]]
oecd.head()


Out[53]:
Country Date
0 AUSTRALIA 7 June 1971
1 AUSTRIA 29 September 1961
2 BELGIUM 13 September 1961
3 CANADA 10 April 1961
4 CHILE 7 May 2010

In [54]:
oecd['year'] = pd.to_datetime(oecd.Date).apply(lambda x: x.year)
oecd_year = oecd.set_index(oecd.Country.str.title())['year'].dropna()
oecd_year


Out[54]:
Country
Australia          1971
Austria            1961
Belgium            1961
Canada             1961
Chile              2010
Czech Republic     1995
Denmark            1961
Estonia            2010
Finland            1969
France             1961
Germany            1961
Greece             1961
Hungary            1996
Iceland            1961
Ireland            1961
Israel             2010
Italy              1962
Japan              1964
Korea              1996
Luxembourg         1961
Mexico             1994
Netherlands        1961
New Zealand        1973
Norway             1961
Poland             1996
Portugal           1961
Slovak Republic    2000
Slovenia           2010
Spain              1961
Sweden             1961
Switzerland        1961
Turkey             1961
United Kingdom     1961
United States      1961
Name: year, dtype: float64

We can create an indicator (binary) variable for OECD status by checking if each country is in the index of countries with membership year less than 1997.

The new DataFrame method assign is a convenient means for creating the new column from this operation.


In [55]:
medals_data = medals.assign(oecd=medals.index.isin((oecd_year[oecd_year<1997]).index).astype(int))

Since the distribution of populations spans several orders of magnitude, we may wish to use the logarithm of the population size, which may be created similarly.


In [56]:
medals_data = medals_data.assign(log_population=np.log(medals.population))

The NumPy log function will return a pandas Series (or DataFrame when applied to one) instead of a ndarray; all of NumPy's functions are compatible with pandas in this way.


In [57]:
medals_data.head()


Out[57]:
medals population oecd log_population
country
Tonga 1 96165 0 11.473821
Bahamas 1 281584 0 12.548186
Jamaica 6 2589043 0 14.766799
Cuba 25 10952046 0 16.209037
Australia 41 18348078 1 16.725035

Comma-separated Values (CSV)

The most common form of delimited data is comma-separated values (CSV). Since CSV is so ubiquitous, the read_csv is available as a convenience function for read_table.

Consider some more microbiome data.


In [58]:
!cat ../data/microbiome/microbiome.csv


Taxon,Patient,Tissue,Stool
Firmicutes,1,632,305
Firmicutes,2,136,4182
Firmicutes,3,1174,703
Firmicutes,4,408,3946
Firmicutes,5,831,8605
Firmicutes,6,693,50
Firmicutes,7,718,717
Firmicutes,8,173,33
Firmicutes,9,228,80
Firmicutes,10,162,3196
Firmicutes,11,372,32
Firmicutes,12,4255,4361
Firmicutes,13,107,1667
Firmicutes,14,96,223
Firmicutes,15,281,2377
Proteobacteria,1,1638,3886
Proteobacteria,2,2469,1821
Proteobacteria,3,839,661
Proteobacteria,4,4414,18
Proteobacteria,5,12044,83
Proteobacteria,6,2310,12
Proteobacteria,7,3053,547
Proteobacteria,8,395,2174
Proteobacteria,9,2651,767
Proteobacteria,10,1195,76
Proteobacteria,11,6857,795
Proteobacteria,12,483,666
Proteobacteria,13,2950,3994
Proteobacteria,14,1541,816
Proteobacteria,15,1307,53
Actinobacteria,1,569,648
Actinobacteria,2,1590,4
Actinobacteria,3,25,2
Actinobacteria,4,259,300
Actinobacteria,5,568,7
Actinobacteria,6,1102,9
Actinobacteria,7,678,377
Actinobacteria,8,260,58
Actinobacteria,9,424,233
Actinobacteria,10,548,21
Actinobacteria,11,201,83
Actinobacteria,12,42,75
Actinobacteria,13,109,59
Actinobacteria,14,51,183
Actinobacteria,15,310,204
Bacteroidetes,1,115,380
Bacteroidetes,2,67,0
Bacteroidetes,3,0,0
Bacteroidetes,4,85,5
Bacteroidetes,5,143,7
Bacteroidetes,6,678,2
Bacteroidetes,7,4829,209
Bacteroidetes,8,74,651
Bacteroidetes,9,169,254
Bacteroidetes,10,106,10
Bacteroidetes,11,73,381
Bacteroidetes,12,30,359
Bacteroidetes,13,51,51
Bacteroidetes,14,2473,2314
Bacteroidetes,15,102,33
Other,1,114,277
Other,2,195,18
Other,3,42,2
Other,4,316,43
Other,5,202,40
Other,6,116,0
Other,7,527,12
Other,8,357,11
Other,9,106,11
Other,10,67,14
Other,11,203,6
Other,12,392,6
Other,13,28,25
Other,14,12,22
Other,15,305,32

This table can be read into a DataFrame using read_csv:


In [59]:
mb = pd.read_csv("../data/microbiome/microbiome.csv")
mb.head()


Out[59]:
Taxon Patient Tissue Stool
0 Firmicutes 1 632 305
1 Firmicutes 2 136 4182
2 Firmicutes 3 1174 703
3 Firmicutes 4 408 3946
4 Firmicutes 5 831 8605

If we have sections of data that we do not wish to import (for example, known bad data), we can populate the skiprows argument:


In [60]:
pd.read_csv("../data/microbiome/microbiome.csv", skiprows=[3,4,6]).head()


Out[60]:
Taxon Patient Tissue Stool
0 Firmicutes 1 632 305
1 Firmicutes 2 136 4182
2 Firmicutes 5 831 8605
3 Firmicutes 7 718 717
4 Firmicutes 8 173 33

Conversely, if we only want to import a small number of rows from, say, a very large data file we can use nrows:


In [61]:
few_recs = pd.read_csv("../data/microbiome/microbiome.csv", nrows=4)

few_recs


Out[61]:
Taxon Patient Tissue Stool
0 Firmicutes 1 632 305
1 Firmicutes 2 136 4182
2 Firmicutes 3 1174 703
3 Firmicutes 4 408 3946

Alternately, if we want to process our data in reasonable chunks, the chunksize argument will return an iterable object that can be employed in a data processing loop. For example, our microbiome data are organized by bacterial phylum, with 15 patients represented in each:


In [62]:
data_chunks = pd.read_csv("../data/microbiome/microbiome.csv", chunksize=15)
data_chunks


Out[62]:
<pandas.io.parsers.TextFileReader at 0x10ff36898>

Exercise: Calculating summary statistics

Import the microbiome data, calculating the mean counts across all patients for each taxon, returning these values in a dictionary.

Hint: using chunksize makes this more efficent!


In [63]:
# Write your answer here

Hierarchical Indices

For a more useful index, we can specify the first two columns, which together provide a unique index to the data.


In [64]:
mb = pd.read_csv("../data/microbiome/microbiome.csv", index_col=['Taxon','Patient'])
mb.head()


Out[64]:
Tissue Stool
Taxon Patient
Firmicutes 1 632 305
2 136 4182
3 1174 703
4 408 3946
5 831 8605

This is called a hierarchical index, which allows multiple dimensions of data to be represented in tabular form.


In [65]:
mb.index


Out[65]:
MultiIndex(levels=[['Actinobacteria', 'Bacteroidetes', 'Firmicutes', 'Other', 'Proteobacteria'], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]],
           labels=[[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]],
           names=['Taxon', 'Patient'])

The corresponding index is a MultiIndex object that consists of a sequence of tuples, the elements of which is some combination of the three columns used to create the index. Where there are multiple repeated values, pandas does not print the repeats, making it easy to identify groups of values.

Rows can be indexed by passing the appropriate tuple.


In [66]:
mb.ix[('Firmicutes', 2)]


Out[66]:
Tissue     136
Stool     4182
Name: (Firmicutes, 2), dtype: int64

With a hierachical index, we can select subsets of the data based on a partial index:


In [67]:
mb.ix['Proteobacteria']


Out[67]:
Tissue Stool
Patient
1 1638 3886
2 2469 1821
3 839 661
4 4414 18
5 12044 83
6 2310 12
7 3053 547
8 395 2174
9 2651 767
10 1195 76
11 6857 795
12 483 666
13 2950 3994
14 1541 816
15 1307 53

To extract arbitrary levels from a hierarchical row index, the cross-section method xs can be used.


In [68]:
mb.xs(1, level='Patient')


Out[68]:
Tissue Stool
Taxon
Firmicutes 632 305
Proteobacteria 1638 3886
Actinobacteria 569 648
Bacteroidetes 115 380
Other 114 277

We may also reorder levels as we like.


In [69]:
mb.swaplevel('Patient', 'Taxon').head()


Out[69]:
Tissue Stool
Patient Taxon
1 Firmicutes 632 305
2 Firmicutes 136 4182
3 Firmicutes 1174 703
4 Firmicutes 408 3946
5 Firmicutes 831 8605

Operations

DataFrame and Series objects allow for several operations to take place either on a single object, or between two or more objects.

For example, we can perform arithmetic on the elements of two objects, such as calculating the ratio of bacteria counts between locations:


In [70]:
mb.Stool / mb.Tissue


Out[70]:
Taxon           Patient
Firmicutes      1           0.482595
                2          30.750000
                3           0.598807
                4           9.671569
                5          10.354994
                6           0.072150
                7           0.998607
                8           0.190751
                9           0.350877
                10         19.728395
                11          0.086022
                12          1.024912
                13         15.579439
                14          2.322917
                15          8.459075
Proteobacteria  1           2.372405
                2           0.737546
                3           0.787843
                4           0.004078
                5           0.006891
                6           0.005195
                7           0.179168
                8           5.503797
                9           0.289325
                10          0.063598
                11          0.115940
                12          1.378882
                13          1.353898
                14          0.529526
                15          0.040551
                             ...    
Bacteroidetes   1           3.304348
                2           0.000000
                3                NaN
                4           0.058824
                5           0.048951
                6           0.002950
                7           0.043280
                8           8.797297
                9           1.502959
                10          0.094340
                11          5.219178
                12         11.966667
                13          1.000000
                14          0.935706
                15          0.323529
Other           1           2.429825
                2           0.092308
                3           0.047619
                4           0.136076
                5           0.198020
                6           0.000000
                7           0.022770
                8           0.030812
                9           0.103774
                10          0.208955
                11          0.029557
                12          0.015306
                13          0.892857
                14          1.833333
                15          0.104918
dtype: float64

Microsoft Excel

Since so much financial and scientific data ends up in Excel spreadsheets (regrettably), pandas' ability to directly import Excel spreadsheets is valuable. This support is contingent on having one or two dependencies (depending on what version of Excel file is being imported) installed: xlrd and openpyxl (these may be installed with either pip or easy_install).

Importing Excel data to pandas is a two-step process. First, we create an ExcelFile object using the path of the file:


In [71]:
mb_file = pd.ExcelFile('../data/microbiome/MID1.xls')
mb_file


Out[71]:
<pandas.io.excel.ExcelFile at 0x10fede358>

Then, since modern spreadsheets consist of one or more "sheets", we parse the sheet with the data of interest:


In [72]:
mb1 = mb_file.parse("Sheet 1", header=None)
mb1.columns = ["Taxon", "Count"]
mb1.head()


Out[72]:
Taxon Count
0 Archaea "Crenarchaeota" Thermoprotei Desulfuro... 7
1 Archaea "Crenarchaeota" Thermoprotei Desulfuro... 2
2 Archaea "Crenarchaeota" Thermoprotei Sulfoloba... 3
3 Archaea "Crenarchaeota" Thermoprotei Thermopro... 3
4 Archaea "Euryarchaeota" "Methanomicrobia" Meth... 7

There is now a read_excel conveneince function in pandas that combines these steps into a single call:


In [73]:
mb2 = pd.read_excel('../data/microbiome/MID2.xls', sheetname='Sheet 1', header=None)
mb2.head()


Out[73]:
0 1
0 Archaea "Crenarchaeota" Thermoprotei Acidiloba... 2
1 Archaea "Crenarchaeota" Thermoprotei Acidiloba... 14
2 Archaea "Crenarchaeota" Thermoprotei Desulfuro... 23
3 Archaea "Crenarchaeota" Thermoprotei Desulfuro... 1
4 Archaea "Crenarchaeota" Thermoprotei Desulfuro... 2

Relational Databases

If you are fortunate, your data will be stored in a database (relational or non-relational) rather than in arbitrary text files or spreadsheet. Relational databases are particularly useful for storing large quantities of structured data, where fields are grouped together in tables according to their relationships with one another.

pandas' DataFrame interacts with relational (i.e. SQL) databases, and even provides facilties for using SQL syntax on the DataFrame itself, which we will get to later. For now, let's work with a ubiquitous embedded database called SQLite, which comes bundled with Python. A SQLite database can be queried with the standard library's sqlite3 module.


In [74]:
import sqlite3

query = '''
CREATE TABLE samples
(taxon VARCHAR(15), patient INTEGER, tissue INTEGER, stool INTEGER);
'''

This query string will create a table to hold some of our microbiome data, which we can execute after connecting to a database (which will be created, if it does not exist).


In [75]:
con = sqlite3.connect('microbiome.sqlite3')
con.execute(query)
con.commit()

In [76]:
few_recs.ix[0]


Out[76]:
Taxon      Firmicutes
Patient             1
Tissue            632
Stool             305
Name: 0, dtype: object

In [77]:
con.execute('INSERT INTO samples VALUES(\'{}\',{},{},{})'.format(*few_recs.ix[0]))


Out[77]:
<sqlite3.Cursor at 0x10f5c16c0>

In [78]:
query = 'INSERT INTO samples VALUES(?, ?, ?, ?)'
con.executemany(query, few_recs.values[1:])


Out[78]:
<sqlite3.Cursor at 0x10f5c1730>

In [79]:
con.commit()

Using SELECT queries, we can read from the database.


In [80]:
cursor = con.execute('SELECT * FROM samples')
rows = cursor.fetchall()

rows


Out[80]:
[('Firmicutes', 1, 632, 305),
 ('Firmicutes', 2, 136, 4182),
 ('Firmicutes', 3, 1174, 703),
 ('Firmicutes', 4, 408, 3946)]

These results can be passed directly to a DataFrame


In [81]:
pd.DataFrame(rows)


Out[81]:
0 1 2 3
0 Firmicutes 1 632 305
1 Firmicutes 2 136 4182
2 Firmicutes 3 1174 703
3 Firmicutes 4 408 3946

To obtain the column names, we can obtain the table information from the database, via the special PRAGMA statement.


In [82]:
table_info = con.execute('PRAGMA table_info(samples);').fetchall()

table_info


Out[82]:
[(0, 'taxon', 'VARCHAR(15)', 0, None, 0),
 (1, 'patient', 'INTEGER', 0, None, 0),
 (2, 'tissue', 'INTEGER', 0, None, 0),
 (3, 'stool', 'INTEGER', 0, None, 0)]

In [83]:
pd.DataFrame(rows, columns=np.transpose(table_info)[1])


Out[83]:
taxon patient tissue stool
0 Firmicutes 1 632 305
1 Firmicutes 2 136 4182
2 Firmicutes 3 1174 703
3 Firmicutes 4 408 3946

A more direct approach is to pass the query to the read_sql_query functon, which returns a populated `DataFrame.


In [84]:
pd.read_sql_query('SELECT * FROM samples', con)


Out[84]:
taxon patient tissue stool
0 Firmicutes 1 632 305
1 Firmicutes 2 136 4182
2 Firmicutes 3 1174 703
3 Firmicutes 4 408 3946

Correspondingly, we can append records into the database with to_sql.


In [85]:
more_recs = pd.read_csv("../data/microbiome/microbiome_missing.csv").head(20)

In [86]:
more_recs.to_sql('samples', con, if_exists='append', index=False)

In [87]:
cursor = con.execute('SELECT * FROM samples')
cursor.fetchall()


Out[87]:
[('Firmicutes', 1, 632, 305),
 ('Firmicutes', 2, 136, 4182),
 ('Firmicutes', 3, 1174, 703),
 ('Firmicutes', 4, 408, 3946),
 ('Firmicutes', 1, 632, 305),
 ('Firmicutes', 2, 136, 4182),
 ('Firmicutes', 3, None, 703),
 ('Firmicutes', 4, 408, 3946),
 ('Firmicutes', 5, 831, 8605),
 ('Firmicutes', 6, 693, 50),
 ('Firmicutes', 7, 718, 717),
 ('Firmicutes', 8, 173, 33),
 ('Firmicutes', 9, 228, None),
 ('Firmicutes', 10, 162, 3196),
 ('Firmicutes', 11, 372, -99999),
 ('Firmicutes', 12, 4255, 4361),
 ('Firmicutes', 13, 107, 1667),
 ('Firmicutes', 14, '?', 223),
 ('Firmicutes', 15, 281, 2377),
 ('Proteobacteria', 1, 1638, 3886),
 ('Proteobacteria', 2, 2469, 1821),
 ('Proteobacteria', 3, 839, 661),
 ('Proteobacteria', 4, 4414, 18),
 ('Proteobacteria', 5, 12044, 83)]

There are several other data formats that can be imported into Python and converted into DataFrames, with the help of buitl-in or third-party libraries. These include JSON, XML, HDF5, non-relational databases, and various web APIs.


In [88]:
# Get rid of the database we created
!rm microbiome.sqlite3

2014 Ebola Outbreak Data

The ../data/ebola folder contains summarized reports of Ebola cases from three countries during the recent outbreak of the disease in West Africa. For each country, there are daily reports that contain various information about the outbreak in several cities in each country.

From these data files, use pandas to import them and create a single data frame that includes the daily totals of new cases for each country.

We may use this compiled data for more advaned applications later in the course.

The data are taken from Caitlin Rivers' ebola GitHub repository, and are licenced for both commercial and non-commercial use. The tutorial repository contains a subset of this data from three countries (Sierra Leone, Liberia and Guinea) that we will use as an example. They reside in a nested subdirectory in the data directory.


In [89]:
ebola_dirs = !ls ../data/ebola/
ebola_dirs


Out[89]:
['LICENSE', 'guinea_data', 'liberia_data', 'sl_data']

Within each country directory, there are CSV files containing daily information regarding the state of the outbreak for that country. The first step is to efficiently import all the relevant files.

Our approach will be to construct a dictionary containing a list of filenames to import. We can use the glob package to identify all the CSV files in each directory. This can all be placed within a dictionary comprehension.


In [90]:
import glob

filenames = {data_dir[:data_dir.find('_')]: glob.glob('../data/ebola/{0}/*.csv'.format(data_dir)) for data_dir in ebola_dirs[1:]}

We are now in a position to iterate over the dictionary and import the corresponding files. However, the data layout of the files across the dataset is partially inconsistent.


In [91]:
pd.read_csv('../data/ebola/sl_data/2014-08-12-v77.csv').head()


Out[91]:
date variable Kailahun Kenema Kono Kambia Koinadugu Bombali Tonkolili Port Loko Pujehun Bo Moyamba Bonthe Western area urban Western area rural National
0 2014-08-12 population 465048 653013 325003 341690 335471 494139 434937 557978 335574 654142 278119 168729 1040888 263619 6348350
1 2014-08-12 new_noncase 0 3 0 0 0 0 0 1 0 0 0 0 0 0 4
2 2014-08-12 new_suspected 0 9 0 0 0 0 0 0 0 1 0 0 0 0 10
3 2014-08-12 new_probable 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1
4 2014-08-12 new_confirmed 0 9 0 0 0 0 0 2 0 0 0 0 0 0 11

In [92]:
pd.read_csv('../data/ebola/guinea_data/2014-09-02.csv').head()


Out[92]:
Date Description Totals Conakry Gueckedou Macenta Dabola Kissidougou Dinguiraye Telimele ... Mzerekore Yomou Dubreka Forecariah Kerouane Coyah Dalaba Beyla Kindia Lola
0 2014-09-02 New cases of suspects 11 NaN 5 6 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2014-09-02 New cases of probables 0 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2014-09-02 New cases of confirmed 14 NaN 4 6 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2014-09-02 Total new cases registered so far 25 0 9 12 0 0 0 0 ... 0 0 1 0 0 3 NaN NaN NaN NaN
4 2014-09-02 Total cases of suspects 49 15 5 17 0 0 0 0 ... 0 1 1 2 5 0 NaN NaN NaN NaN

5 rows × 24 columns

Clearly, we will need to develop row masks to extract the data we need across all files, without having to manually extract data from each file.

Let's hack at one file to develop the mask.


In [93]:
sample = pd.read_csv('../data/ebola/sl_data/2014-08-12-v77.csv')

To prevent issues with capitalization, we will simply revert all labels to lower case.


In [94]:
lower_vars = sample.variable.str.lower()

Since we are interested in extracting new cases only, we can use the string accessor attribute to look for key words that we would like to include or exclude.


In [95]:
case_mask = (lower_vars.str.contains('new') 
             & (lower_vars.str.contains('case') | lower_vars.str.contains('suspect')) 
             & ~lower_vars.str.contains('non')
             & ~lower_vars.str.contains('total'))

We could have instead used regular expressions to do the same thing.

Finally, we are only interested in three columns.


In [96]:
sample.loc[case_mask, ['date', 'variable', 'National']]


Out[96]:
date variable National
2 2014-08-12 new_suspected 10

We can now embed this operation in a loop over all the filenames in the database.


In [97]:
datasets = []
for country in filenames:
    
    country_files = filenames[country]
    for f in country_files:
        
        data = pd.read_csv(f)
        
        
        # Convert to lower case to avoid capitalization issues
        data.columns = data.columns.str.lower()
        # Column naming is inconsistent. These procedures deal with that.
        keep_columns = ['date']
        if 'description' in data.columns:
            keep_columns.append('description')
        else:
            keep_columns.append('variable')
            
        if 'totals' in data.columns:
            keep_columns.append('totals')
        else:
            keep_columns.append('national')
            
        # Index out the columns we need, and rename them
        keep_data = data[keep_columns]
        keep_data.columns = 'date', 'variable', 'totals'
        
        # Extract the rows we might want
        lower_vars = keep_data.variable.str.lower()
        # Of course we can also use regex to do this
        case_mask = (lower_vars.str.contains('new') 
                     & (lower_vars.str.contains('case') | lower_vars.str.contains('suspect') 
                                                        | lower_vars.str.contains('confirm')) 
                     & ~lower_vars.str.contains('non')
                     & ~lower_vars.str.contains('total'))
        
        keep_data = keep_data[case_mask].dropna()
        
        # Convert data types
        keep_data['date'] = pd.to_datetime(keep_data.date)
        keep_data['totals'] = keep_data.totals.astype(int)
        
        # Assign country label and append to datasets list
        datasets.append(keep_data.assign(country=country))

Now that we have a list populated with DataFrame objects for each day and country, we can call concat to concatenate them into a single DataFrame.


In [98]:
all_data = pd.concat(datasets)
all_data.head()


Out[98]:
date variable totals country
2 2014-08-12 new_suspected 10 sl
4 2014-08-12 new_confirmed 11 sl
2 2014-08-13 new_suspected 3 sl
4 2014-08-13 new_confirmed 15 sl
2 2014-08-14 new_suspected 0 sl

This works because the structure of each table was identical

Manipulating indices

Notice from above, however, that the index contains redundant integer index values. We can confirm this:


In [99]:
all_data.index.is_unique


Out[99]:
False

We can create a new unique index by calling the reset_index method on the new data frame after we import it, which will generate a new ordered, unique index.


In [100]:
all_data = pd.concat(datasets).reset_index(drop=True)
all_data.head()


Out[100]:
date variable totals country
0 2014-08-12 new_suspected 10 sl
1 2014-08-12 new_confirmed 11 sl
2 2014-08-13 new_suspected 3 sl
3 2014-08-13 new_confirmed 15 sl
4 2014-08-14 new_suspected 0 sl

Reindexing allows users to manipulate the data labels in a DataFrame. It forces a DataFrame to conform to the new index, and optionally, fill in missing data if requested.

A simple use of reindex is to alter the order of the rows. For example, records are currently ordered first by country then by day, since this is the order in which they were iterated over and imported. We might arbitrarily want to reverse the order, which is performed by passing the appropriate index values to reindex.


In [101]:
all_data.reindex(all_data.index[::-1])


Out[101]:
date variable totals country
636 2014-12-09 New case/s (confirmed) 2946 liberia
635 2014-12-09 New Case/s (Probable) 1801 liberia
634 2014-12-09 New Case/s (Suspected) 3050 liberia
633 2014-12-08 New case/s (confirmed) 2927 liberia
632 2014-12-08 New Case/s (Probable) 1805 liberia
631 2014-12-08 New Case/s (Suspected) 3054 liberia
630 2014-12-07 New case/s (confirmed) 2869 liberia
629 2014-12-07 New Case/s (Probable) 1829 liberia
628 2014-12-07 New Case/s (Suspected) 3067 liberia
627 2014-12-06 New case/s (confirmed) 2869 liberia
626 2014-12-06 New Case/s (Probable) 1810 liberia
625 2014-12-06 New Case/s (Suspected) 3056 liberia
624 2014-12-05 New case/s (confirmed) 2867 liberia
623 2014-12-05 New Case/s (Probable) 1808 liberia
622 2014-12-05 New Case/s (Suspected) 3056 liberia
621 2014-12-04 New case/s (confirmed) 2867 liberia
620 2014-12-04 New Case/s (Probable) 1800 liberia
619 2014-12-04 New Case/s (Suspected) 3054 liberia
618 2014-12-02 New case/s (confirmed) 9 liberia
617 2014-12-02 New Case/s (Probable) 10 liberia
616 2014-12-02 New Case/s (Suspected) 18 liberia
615 2014-12-02 Newly Reported Cases in HCW 1 liberia
614 2014-12-01 New case/s (confirmed) 1 liberia
613 2014-12-01 New Case/s (Probable) 9 liberia
612 2014-12-01 New Case/s (Suspected) 25 liberia
611 2014-11-30 New case/s (confirmed) 10 liberia
610 2014-11-30 Newly Reported Cases in HCW 1 liberia
609 2014-11-29 New case/s (confirmed) 10 liberia
608 2014-11-29 New Case/s (Probable) 4 liberia
607 2014-11-29 New Case/s (Suspected) 7 liberia
... ... ... ... ...
29 2014-08-27 new_confirmed 27 sl
28 2014-08-27 new_suspected 3 sl
27 2014-08-25 new_confirmed 20 sl
26 2014-08-25 new_suspected 5 sl
25 2014-08-24 new_confirmed 31 sl
24 2014-08-24 new_suspected 0 sl
23 2014-08-23 new_confirmed 23 sl
22 2014-08-23 new_suspected 4 sl
21 2014-08-22 new_confirmed 56 sl
20 2014-08-22 new_suspected 1 sl
19 2014-08-21 new_confirmed 9 sl
18 2014-08-21 new_suspected 0 sl
17 2014-08-20 new_confirmed 4 sl
16 2014-08-20 new_suspected 1 sl
15 2014-08-19 new_confirmed 9 sl
14 2014-08-19 new_suspected 16 sl
13 2014-08-18 new_confirmed 5 sl
12 2014-08-18 new_suspected 40 sl
11 2014-08-17 new_confirmed 2 sl
10 2014-08-17 new_suspected 1 sl
9 2014-08-16 new_confirmed 18 sl
8 2014-08-16 new_suspected 3 sl
7 2014-08-15 new_confirmed 10 sl
6 2014-08-15 new_suspected 6 sl
5 2014-08-14 new_confirmed 13 sl
4 2014-08-14 new_suspected 0 sl
3 2014-08-13 new_confirmed 15 sl
2 2014-08-13 new_suspected 3 sl
1 2014-08-12 new_confirmed 11 sl
0 2014-08-12 new_suspected 10 sl

637 rows × 4 columns

Notice that the reindexing operation is not performed "in-place"; the original DataFrame remains as it was, and the method returns a copy of the DataFrame with the new index. This is a common trait for pandas, and is a Good Thing.

We may also wish to reorder the columns this way.


In [102]:
all_data.reindex(columns=['date', 'country', 'variable', 'totals']).head()


Out[102]:
date country variable totals
0 2014-08-12 sl new_suspected 10
1 2014-08-12 sl new_confirmed 11
2 2014-08-13 sl new_suspected 3
3 2014-08-13 sl new_confirmed 15
4 2014-08-14 sl new_suspected 0

Group by operations

One of pandas' most powerful features is the ability to perform operations on subgroups of a DataFrame. These so-called group by operations defines subunits of the dataset according to the values of one or more variabes in the DataFrame.

For this data, we want to sum the new case counts by day and country; so we pass these two column names to the groupby method, then sum the totals column accross them.


In [103]:
all_data_grouped = all_data.groupby(['country', 'date'])
daily_cases = all_data_grouped['totals'].sum()
daily_cases.head(10)


Out[103]:
country  date      
guinea   2014-08-04    11
         2014-08-26    28
         2014-08-27    22
         2014-08-30    24
         2014-08-31    46
         2014-09-02    25
         2014-09-04    30
         2014-09-07    18
         2014-09-08    18
         2014-09-09    16
Name: totals, dtype: int64

The resulting series retains a hierarchical index from the group by operation. Hence, we can index out the counts for a given country on a particular day by indexing with the appropriate tuple.


In [104]:
daily_cases[('liberia', '2014-09-02')]


Out[104]:
liberia                78
2014-09-02 00:00:00    78
Name: totals, dtype: int64

One issue with the data we have extracted is that there appear to be serious outliers in the Liberian counts. The values are much too large to be a daily count, even during a serious outbreak.


In [105]:
daily_cases.sort(ascending=False)
daily_cases.head(10)


Out[105]:
country  date      
liberia  2014-12-09    7797
         2014-12-08    7786
         2014-12-07    7765
         2014-12-06    7735
         2014-12-05    7731
         2014-12-04    7721
         2014-10-17     167
sl       2014-11-08     131
         2014-11-20     130
         2014-11-10     126
Name: totals, dtype: int64

We can filter these outliers using an appropriate threshold.


In [106]:
daily_cases = daily_cases[daily_cases<200]

Plotting

pandas data structures have high-level methods for creating a variety of plots, which tends to be easier than generating the corresponding plot using matplotlib.

For example, we may want to create a plot of the cumulative cases for each of the three countries. The easiest way to do this is to remove the hierarchical index, and create a DataFrame of three columns, which will result in three lines when plotted.

First, call unstack to remove the hierarichical index:


In [107]:
daily_cases.unstack().head()


Out[107]:
date 2014-06-16 00:00:00 2014-06-17 00:00:00 2014-06-22 00:00:00 2014-06-24 00:00:00 2014-06-25 00:00:00 2014-06-28 00:00:00 2014-06-29 00:00:00 2014-07-01 00:00:00 2014-07-02 00:00:00 2014-07-03 00:00:00 ... 2014-11-24 00:00:00 2014-11-26 00:00:00 2014-11-27 00:00:00 2014-11-28 00:00:00 2014-11-29 00:00:00 2014-11-30 00:00:00 2014-12-01 00:00:00 2014-12-02 00:00:00 2014-12-04 00:00:00 2014-12-05 00:00:00
country
guinea NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
liberia 4 2 14 6 9 10 3 4 4 4 ... 25 31 46 7 21 11 35 38 NaN NaN
sl NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 115 NaN NaN 110 88 NaN 86 NaN 41 78

3 rows × 128 columns

Next, transpose the resulting DataFrame to swap the rows and columns.


In [108]:
daily_cases.unstack().T.head()


Out[108]:
country guinea liberia sl
date
2014-06-16 NaN 4 NaN
2014-06-17 NaN 2 NaN
2014-06-22 NaN 14 NaN
2014-06-24 NaN 6 NaN
2014-06-25 NaN 9 NaN

Since we have missing values for some dates, we will assume that the counts for those days were zero (the actual counts for that day may have bee included in the next reporting day's data).


In [109]:
daily_cases.unstack().T.fillna(0).head()


Out[109]:
country guinea liberia sl
date
2014-06-16 0 4 0
2014-06-17 0 2 0
2014-06-22 0 14 0
2014-06-24 0 6 0
2014-06-25 0 9 0

Finally, calculate the cumulative sum for all the columns, and generate a line plot, which we get by default.


In [110]:
daily_cases.unstack().T.fillna(0).cumsum().plot()


Out[110]:
<matplotlib.axes._subplots.AxesSubplot at 0x110235dd8>

Resampling

An alternative to filling days without case reports with zeros is to aggregate the data at a coarser time scale. New cases are often reported by week; we can use the resample method to summarize the data into weekly values.


In [111]:
weekly_cases = daily_cases.unstack().T.resample('W', how='sum')
weekly_cases


Out[111]:
country guinea liberia sl
date
2014-06-22 NaN 20 NaN
2014-06-29 NaN 28 NaN
2014-07-06 NaN 12 NaN
2014-07-13 NaN 18 NaN
2014-07-20 NaN 15 NaN
2014-07-27 NaN 56 NaN
2014-08-03 NaN 11 NaN
2014-08-10 11 10 NaN
2014-08-17 NaN 106 92
2014-08-24 NaN 122 199
2014-08-31 120 94 189
2014-09-07 73 402 141
2014-09-14 83 436 216
2014-09-21 55 404 301
2014-09-28 95 275 408
2014-10-05 51 207 368
2014-10-12 NaN 248 409
2014-10-19 NaN 330 496
2014-10-26 NaN 220 452
2014-11-02 NaN 228 499
2014-11-09 NaN 54 236
2014-11-16 NaN 60 449
2014-11-23 NaN 120 434
2014-11-30 NaN 141 313
2014-12-07 NaN 73 205

In [112]:
weekly_cases.cumsum().plot()


Out[112]:
<matplotlib.axes._subplots.AxesSubplot at 0x1103cf7f0>

Writing Data to Files

As well as being able to read several data input formats, pandas can also export data to a variety of storage formats. We will bring your attention to just one of these, but the usage is similar across formats.


In [113]:
medals_data.to_csv("../data/medals.csv", index=False)

The to_csv method writes a DataFrame to a comma-separated values (csv) file. You can specify custom delimiters (via sep argument), how missing values are written (via na_rep argument), whether the index is writen (via index argument), whether the header is included (via header argument), among other options.

Missing data

The occurence of missing data is so prevalent that it pays to use tools like pandas, which seamlessly integrates missing data handling so that it can be dealt with easily, and in the manner required by the analysis at hand.

Missing data are represented in Series and DataFrame objects by the NaN floating point value. However, None is also treated as missing, since it is commonly used as such in other contexts (e.g. NumPy).


In [114]:
!head -n 20 ../data/microbiome/microbiome_missing.csv


Taxon,Patient,Tissue,Stool
Firmicutes,1,632,305
Firmicutes,2,136,4182
Firmicutes,3,,703
Firmicutes,4,408,3946
Firmicutes,5,831,8605
Firmicutes,6,693,50
Firmicutes,7,718,717
Firmicutes,8,173,33
Firmicutes,9,228,NA
Firmicutes,10,162,3196
Firmicutes,11,372,-99999
Firmicutes,12,4255,4361
Firmicutes,13,107,1667
Firmicutes,14,?,223
Firmicutes,15,281,2377
Proteobacteria,1,1638,3886
Proteobacteria,2,2469,1821
Proteobacteria,3,839,661
Proteobacteria,4,4414,18

In [115]:
pd.read_csv("../data/microbiome/microbiome_missing.csv").head(20)


Out[115]:
Taxon Patient Tissue Stool
0 Firmicutes 1 632 305
1 Firmicutes 2 136 4182
2 Firmicutes 3 NaN 703
3 Firmicutes 4 408 3946
4 Firmicutes 5 831 8605
5 Firmicutes 6 693 50
6 Firmicutes 7 718 717
7 Firmicutes 8 173 33
8 Firmicutes 9 228 NaN
9 Firmicutes 10 162 3196
10 Firmicutes 11 372 -99999
11 Firmicutes 12 4255 4361
12 Firmicutes 13 107 1667
13 Firmicutes 14 ? 223
14 Firmicutes 15 281 2377
15 Proteobacteria 1 1638 3886
16 Proteobacteria 2 2469 1821
17 Proteobacteria 3 839 661
18 Proteobacteria 4 4414 18
19 Proteobacteria 5 12044 83

Above, pandas recognized NA and an empty field as missing data.


In [116]:
pd.isnull(pd.read_csv("../data/microbiome/microbiome_missing.csv")).head(20)


Out[116]:
Taxon Patient Tissue Stool
0 False False False False
1 False False False False
2 False False True False
3 False False False False
4 False False False False
5 False False False False
6 False False False False
7 False False False False
8 False False False True
9 False False False False
10 False False False False
11 False False False False
12 False False False False
13 False False False False
14 False False False False
15 False False False False
16 False False False False
17 False False False False
18 False False False False
19 False False False False

Unfortunately, there will sometimes be inconsistency with the conventions for missing data. In this example, there is a question mark "?" and a large negative number where there should have been a positive integer. We can specify additional symbols with the na_values argument:


In [117]:
missing_sample = pd.read_csv("../data/microbiome/microbiome_missing.csv", 
                             na_values=['?', -99999], nrows=20)

missing_sample


Out[117]:
Taxon Patient Tissue Stool
0 Firmicutes 1 632 305
1 Firmicutes 2 136 4182
2 Firmicutes 3 NaN 703
3 Firmicutes 4 408 3946
4 Firmicutes 5 831 8605
5 Firmicutes 6 693 50
6 Firmicutes 7 718 717
7 Firmicutes 8 173 33
8 Firmicutes 9 228 NaN
9 Firmicutes 10 162 3196
10 Firmicutes 11 372 NaN
11 Firmicutes 12 4255 4361
12 Firmicutes 13 107 1667
13 Firmicutes 14 NaN 223
14 Firmicutes 15 281 2377
15 Proteobacteria 1 1638 3886
16 Proteobacteria 2 2469 1821
17 Proteobacteria 3 839 661
18 Proteobacteria 4 4414 18
19 Proteobacteria 5 12044 83

These can be specified on a column-wise basis using an appropriate dict as the argument for na_values.

By default, dropna drops entire rows in which one or more values are missing.


In [118]:
missing_sample.dropna()


Out[118]:
Taxon Patient Tissue Stool
0 Firmicutes 1 632 305
1 Firmicutes 2 136 4182
3 Firmicutes 4 408 3946
4 Firmicutes 5 831 8605
5 Firmicutes 6 693 50
6 Firmicutes 7 718 717
7 Firmicutes 8 173 33
9 Firmicutes 10 162 3196
11 Firmicutes 12 4255 4361
12 Firmicutes 13 107 1667
14 Firmicutes 15 281 2377
15 Proteobacteria 1 1638 3886
16 Proteobacteria 2 2469 1821
17 Proteobacteria 3 839 661
18 Proteobacteria 4 4414 18
19 Proteobacteria 5 12044 83

If we want to drop missing values column-wise instead of row-wise, we use axis=1.


In [119]:
missing_sample.dropna(axis=1)


Out[119]:
Taxon Patient
0 Firmicutes 1
1 Firmicutes 2
2 Firmicutes 3
3 Firmicutes 4
4 Firmicutes 5
5 Firmicutes 6
6 Firmicutes 7
7 Firmicutes 8
8 Firmicutes 9
9 Firmicutes 10
10 Firmicutes 11
11 Firmicutes 12
12 Firmicutes 13
13 Firmicutes 14
14 Firmicutes 15
15 Proteobacteria 1
16 Proteobacteria 2
17 Proteobacteria 3
18 Proteobacteria 4
19 Proteobacteria 5

Rather than omitting missing data from an analysis, in some cases it may be suitable to fill the missing value in, either with a default value (such as zero), a sentinel value, or a value that is either imputed or carried forward/backward from similar data points. We can do this programmatically in pandas with the fillna argument.


In [120]:
missing_sample.fillna(-999)


Out[120]:
Taxon Patient Tissue Stool
0 Firmicutes 1 632 305
1 Firmicutes 2 136 4182
2 Firmicutes 3 -999 703
3 Firmicutes 4 408 3946
4 Firmicutes 5 831 8605
5 Firmicutes 6 693 50
6 Firmicutes 7 718 717
7 Firmicutes 8 173 33
8 Firmicutes 9 228 -999
9 Firmicutes 10 162 3196
10 Firmicutes 11 372 -999
11 Firmicutes 12 4255 4361
12 Firmicutes 13 107 1667
13 Firmicutes 14 -999 223
14 Firmicutes 15 281 2377
15 Proteobacteria 1 1638 3886
16 Proteobacteria 2 2469 1821
17 Proteobacteria 3 839 661
18 Proteobacteria 4 4414 18
19 Proteobacteria 5 12044 83

Sentinel values are useful in pandas because missing values are treated as floats, so it is impossible to use explicit missing values with integer columns. Using some large (positive or negative) integer as a sentinel value will allow the column to be integer typed.

Exercise: Mean imputation

Fill the missing values in missing_sample with the mean count from the corresponding species across patients.


In [121]:
## Write your answer here